1 2 Pipelining Introduction • Consider a drink bottling plant – Filling the bottle = 3 sec. – Placing the cap = 3 sec. – Labeling = 3 sec. EE 457 Unit 6a • Would you want… – Machine 1 = Does all three (9 secs.), outputs the bottle, repeats… – Machine 2 = Divided into three parts (one for each step) passing bottles between them Basic Pipelining Techniques • Machine _____ offers ability to __________ Filler + Capper + Label (3 + 3 + 3) Place Cap (3 sec) Filler (3 sec) Labeler (3 sec) 3 4 More Pipelining Examples A3 B3 X Y B[3:0] A2 B2 X Y A1 B1 X Y A0 B0 X Y Co FA Ci Co FA Ci Co FA Ci Co FA Ci S S S S Z3 addr A[3:0] i Z2 Z1 5 ns 0 i Z[3:0] Z0 BMEM data i addr AMEM data • Freshman/Sophomore/Junior/Senior – Z[i] = A[i] + B[i] – Delay: 10ns Mem. Access (read or write), 10 ns each FA – Clock cycle time = _____________________________________ addr – Would you buy a combo washer + dryer unit that does both operations in the same tank?? • Consider adding an array of 4-bit numbers: data • Car Assembly Line • Wash/Dry/Fold Summing Elements ZMEM 5 6 Pipelined Adder Pipelining Effects on Clock Period 5 ns AMEM addr i i A[3:0] data 10ns Pipeline Register (Stage Latch) S1/S2 BMEM • Rather than just try to balance delay we could consider making more stages addr B[3:0] A3 B3 A2 B2 A1 B1 A0 B0 A3 B3 A2 B2 A1 B1 A0 B0 data Y FA Ci Co S If we assume that the pipeline registers are ideal (0ns additional delay) we can clock the pipe every __ ns. Speedup = ____ C1 S2/S3 X Co 0 S0 Y FA Ci S C2 S3/S4 S1 X Y FA Ci Co S C3 S4/S5 X Co S5/S6 C4 S2 Y FA Ci S S3 Z3 S2 Z2 Z[3:0] Z0 10 ns 10 ns Ex. 2: Balanced stage delay Clock Period = __ns (____ speedup) 5 ns 5 ns 5 ns 5 ns ZMEM i Z1 Ex. 1: Unbalanced stage delay Clock Period = ___ns – Divide long stage into multiple stages – In Example 3, clock period could be _________________ – Time through the pipeline (latency) is still 20 ns, but we’ve doubled our throughput (1 result every __ ns rather than every 10 or 15 ns) – Note: There is a small time overhead to adding a pipeline register/stage (i.e. can’t go crazy adding stages) X 15 ns addr 10ns Ex. 3: Break long stage into multiple stages Clock period = ___(___ speedup) data 7 8 To Register or Latch? But Can We Latch? • We can latch if we run the latches on opposite phases of the clock or have a so-called _________________ • What should we use for pipeline stages – Registers [edge-sensitive] …or… – Latches [level-sensitive] – Because each latch runs on the opposite phase data can only move one step before being stopped by a latch that is in hold (off) mode • Latches may allow data to _________________ • You may learn more about this in EE577a or EE560 (a technique known as Slack Borrowing & Time Stealing) • Answer: __________________ Φ S2b S3a Latch ~Φ S2a Latch S1b Latch S3 Latch S2 Register or Latch Register or Latch S1 S3b 9 10 Pipelining Introduction Non-Pipelined Execution • Implementation technique that _________execution of multiple instructions at once • Improves ___________ rather an single-instruction execution latency • ______________ stage determines clock cycle time [e.g. a 30 min. wash cycle but 1 hour dry time means _________ per load] • In the case of perfectly balanced stages: Fetch (I-MEM) Reg. Read ALU Op. Data Mem Reg. Write Total Time Load 10 ns 5 ns 10 ns 10 ns 5 ns 40 ns Store R-Type Branch – Speedup = Jump • A 5-stage pipelined CPU may not realize this speedup 5x b/c… – – – – Instruction time The stages may not be perfectly balanced The overhead of filling up the pipe initially The overhead (setup time and clock-to-Q) delay of the stage registers Inability to keep the pipe full due to branches & data hazards 40 ns LW $5,100($2) Fetch Reg ALU Data Reg 40 ns Fetch Reg ALU LW $7,40($6) Data Reg 40 ns Fetch LW $8,24($6) … 3 Instructions = 3*40 ns 11 Pipelined Execution time LW $7,40($6) LW $8,24($6) Fetch Reg 30 ns 40 ns ALU Data Reg Fetch Reg ALU Fetch Reg 50 ns 60 ns 70 ns 80 ns – i.e. Multicycle CPU w/ k steps or single cycle CPU w/ clock cycle k times slower Data Reg ALU Fetch Reg Data Reg ALU • Execute n instructions using a k stage datapath • w/o pipelining: ___________ Data Reg • w/ pipelining: ____________ – 70 ns = ___ ns for 1st instruc + _____ for the remaining instructions to complete • The speedup looks like it is only 120 ns / 70 ns = 1.7x • But consider 1003 instructions: ____________________________ – The overhead of filling the pipeline is ___________ over steady-state execution when the pipeline is full – ___ cycles for 1st instruc. + ____ cycles for n-1 instrucs. – Assumes we keep the pipeline full Decode 10ns Exec. 10ns Mem. 10ns WB 10ns C1 ADD C2 SUB ADD C3 LW SUB ADD C4 SW LW SUB C5 AND SW LW SUB ADD C6 OR AND SW LW SUB C7 XOR OR AND SW LW XOR OR AND SW XOR OR AND XOR OR C8 C9 C10 ADD XOR C11 7 Instrucs. = 11 clocks Pipeline Emptying • Notice that even though the register access only takes 5 ns it is allocated a 10 ns slot in the pipeline • Total time for these 3 pipelined instructions = Fetch 10ns Pipeline Full … 20 ns Pipelined Timing Pipeline Filling LW $5,100($2) 10 ns 12 13 14 Single-Cycle CPU Datapath Designing the Pipelined Datapath Fetch (IF) • To pipeline a datapath in five stages means five instructions will be executing (“in-flight”) during any single clock cycle • Resources cannot be shared between stages because there may always be an instruction wanting to use the resource Decode (ID) Exec. (EX) Mem WB 0 1 A + 4 Sh. Left 2 + PCSrc B Branch RegWrite [25:21] 5 [20:16] 5 Addr. PC [15:11] Instruc. Write Reg. # 0 1 5 Write Data I-Cache [15:0] RegDst MemRead Read Reg. 2 # Read data 1 0 Zero Read data 2 0 ALU – Each stage needs its own resources – The single-cycle CPU datapath also matches this concept of no shared resources – We can simply divide the single-cycle CPU into stages Read Reg. 1 # Res. Read Data 1 Register File 16 Addr. Sign Extend 32 1 Write Data ALUSrc INST[5:0] ALUOp[1:0] D-Cache MemtoReg ALU control MemWrite 14 15 16 Information Flow in a Pipeline • Data or control information should flow only in the forward direction in a linear pipeline – Non-linear pipelines where information is fed back into a previous stage occurs in more complex pipelines such as floating point dividers • The CPU pipeline is like a buffet line or cafeteria where people can not try to revisit a a previous serving station without disrupting the smooth flow of the line ??? Buffet Line Register File • Don’t we have a non-linear flow when we write a value back to the register file? – An instruction in WB is re-using the register file in the ID stage – Actually we are utilizing different ________of the register file • ID stage ___________ register values • WB stage __________ register value – Like a buffet line with _________ at one station IM Reg ALU IM ??? Buffet Line Reg 17 18 Pipelining the Fetch Phase CC2 CC3 CC4 CC5 IM Reg ALU DM Reg IM Reg IM ADD $3,$4,$5 ALU Reg IM CC6 CC7 CC8 Write $5 DM Reg ALU DM Reg ALU DM Reg Fetch 4 A B 0 1 Addr. Instruc. I-Cache Stage Register LW $5,100($2) CC1 • Note that to keep the pipeline full we have to fetch a new instruction every clock cycle • Thus we have to perform PC = PC + 4 every clock cycle • Thus there shall be no pipelining registers in the datapath responsible for PC = PC +4 • Support for branch/jump warrants a lengthy discussion which we will perform later + • Only an issue if WB to same register as being read • Register file can be designed to do “internal forwarding” where the data being written is immediately ______ out as the ___________________ PC Register File Reg Read $5 19 20 Pipeline Packing List Basic 5 Stage Pipeline Op = 35 rs=1 rt=10 immed.=40 SW $15,100($2) Op = 43 rs=2 rt=15 immed.=100 Fetch Addr. Instruc. I-Cache Instruction Register 5 Read Reg. 1 # rt 5 Read Reg. 2 # 5 Write Reg. # rt/rd Write Data Read data 2 Register File 16 Instruc = 32 Read data 1 Sign Extend 32 Sh. Left 2 Mem Zero 0 1 ALU 1 rs B Exec. WB + 0 A + 4 Decode Res. Addr. Read Data Write Data D-Cache Pipeline Stage Register LW $10,40($1) Pipeline Stage Register • PC • Just as when you go on a trip you have to pack everything you need _____________, so in pipelines you have to take all the control and data you will need with you down the pipeline Compute the size of each pipeline register (find the max. info needed for any instruction in each stage) To simplify, just consider LW/SW (Ignore control signals) Pipeline Stage Register • 0 1 21 22 Basic 5 Stage Pipeline 4 Decode 5 Read Reg. 1 # 5 Read Reg. 2 # + PC 1 Addr. Instruc. I-Cache Write Reg. # Write Data Read data 1 Read data 2 Register File 16 Sign Extend Sh. Left 2 Zero ALU 0 Instruction Register B Mem 0 Res. 1 0 32 WB + A Exec. Addr. Read Data Write Data Pipeline Stage Register Fetch Pipeline Stage Register • There is a bug in the load instruction implementation Which register is written with the data read from memory? We need to preserve the ______________ number by carrying it through the pipeline with us LW $10,40($1) In general this is true for all signals needed later in the pipe SW $15,100($2) Pipeline Stage Register • • • 0 PIPELINE CONTROL 1 D-Cache 1 23 Pipeline Control Overview Stage Control • We will just consider basic (simple) pipeline control and deal with problems related to branch and data hazards later • It is assumed that the PC and pipeline register update on each clock cycle so no separate write enable signals are needed for these registers Read Reg. 1 # 5 Read Reg. 2 # Instruc. I-Cache Instruction Register PC 1 Addr. Write Reg. # Write Data Read data 1 Read data 2 Register File 16 Sign Extend 32 Zero 0 1 0 1 ALU 0 Sh. Left 2 WB Res. Addr. Read Data Write Data D-Cache Pipeline Stage Register 5 + B Mem + A Exec. Pipeline Stage Register 4 Decode Pipeline Stage Register Fetch 24 0 1 • Instruction Fetch: The control signals to read instruction memory and to write the PC are always asserted, so there is nothing special to control in this pipeline stage • ID/RF: As in the previous stage the same thing happens at every clock cycle so there are no optional control lines to set • Execution: The signals to be set are RegDst, ALUop/Func, and ALUSrc. The signals determine whether the result of the instruction written into the register specified by bits 20-16 (for a load) or 15-11 for an R-format), specify the ALU operation, and determine whether the second operand of the ALU will be a register or a sign-extended immediate • Memory Stage: The control lines set in this stage are Branch, MemRead, and MemWrite. These signals are set for the BEQ, LW, and SW instructions respectively • WriteBack: the two control lines are RegWrite , which writes the chosen register, and MemToReg, which decides between the ALU result or memory value as the write data 25 26 Control Signal Generation • Recall from the Single-Cycle CPU discussion that there is no state machine control, but a simple translator (combinational logic) to translate the 6bit opcode into these 9 control signals • Since the datapaths of the single-cycle and pipelined CPU are essentially the same, so is the control • The main difference is that the control signals are generated in one clock cycle and used in a subsequent cycle (later pipeline stage) • We can produce all our signals in the ________ and use the pipeline registers to store and pass them to the _______________ stage ALU Src ALU Op[1:0] Func[5:0] 1 0 10 … R-format Branch Mem Read Mem Write Reg Write MemtoReg 0 0 1 0 0 LW 0 1 00 X 0 1 0 1 1 SW X 1 00 X 0 0 1 0 X Beq X 0 01 X 0 0 0 0 X [25:0] Reg Dst RegWrite [25:21] 5 [20:16] 5 Addr. [15:11] Instruc. Read Reg. 1 # Read Reg. 2 # Read data 1 Write Reg. # 0 1 5 Read data 2 Write Data I-Cache RegDst [15:0] Instruction ALUOp[1:0] MemtoReg RegDst ALUSrc Control PC • How many control signals are needed in each stage [31:26] Control Signals per Stage 16 Register File Sign Extend 27 28 Basic 5 Stage Pipeline Exercise: 16 Sign Extend 32 1 0 1 Res. Addr. Read Data Write Data 5 0 0 1 1 Addr. Instruc. I-Cache 5 Read Reg. 2 # Write Reg. # Write Data Read data 1 Read data 2 Register File D-Cache 16 Sign Extend 32 ALUSrc,RegDst, ALUOp, (Func) Sh. Left 2 Zero 0 1 0 1 Branch, MemRead, MemWrite WB WB Mem WB WB Ex Mem WB Read Reg. 1 # Mem Res. Addr. Read Data Write Data D-Cache Pipeline Stage Register Register File Zero 0 A B PC Read data 2 Control Exec. Pipeline Stage Register Write Data RegWrite, MemToReg Instruction Register I-Cache Instruction Register PC Instruc. Read data 1 Decode => AND $12,$4,$5 ALU 1 Write Reg. # ALU 0 Addr. Fetch 4 Pipeline Stage Register 5 Read Reg. 2 # Sh. Left 2 WB => SUB $11,$2,$3 => ADD $14,$8,$9 RegWrite, MemToReg + Read Reg. 1 # Branch, MemRead, MemWrite – LW $10,40($1) => OR $13,$6,$7 + 5 + B ALUSrc,RegDst, ALUOp, (Func) On copies of this sheet, show this sequence executing on the pipeline: + A Mem Mem WB Control 4 Exec. Pipeline Stage Register Ex Mem WB Decode Pipeline Stage Register Fetch • Pipeline Stage Register • Control is generated in the decode stage and passed along to consuming stages through stage registers 0 1 29 30 Review • Although an instruction can begin at each clock cycle, an individual instruction still takes five clock cycles • Note that it takes four clock cycle before the five-stage pipeline is operating at full efficiency • Register write-back is controlled by the WB stage even though the register file is located in the ID stage; the correct write register ID is carried down the pipeline with the instruction data • When a stage is inactive, the values of the control lines are deasserted (shown as 0's) to prevent anything harmful from occurring • No state machine is needed; sequencing of the control signals follows simply from the pipeline itself (i.e. control signals are produced initially but delayed by the stage registers until the correct stage / clock cycle for application of that signal) ADDITIONAL REFERENCE 31 32 16 Sign Extend 32 1 Addr. Read Data Write Data D-Cache 1 1 Addr. PC Res. 0 Instruc. I-Cache Reg. 1 # 5 Read Reg. 2 # Write Reg. # Write Data Read data 1 Read data 2 Register File 16 Sign Extend 32 $t1 # Fetch LW and increment PC Decode instruction and fetch operands Pipeline Stage Register Register File 0 Pipeline Stage Register Instruction Register Read data 2 0 5 Sh. Left 2 Mem Zero 0 1 ALU Write Data Zero ALU Instruc. I-Cache Write Reg. # Read data 1 $s0 # Read B Pipeline Stage Register Read Reg. 2 # A Exec. WB + 5 Sh. Left 2 Decode Res. Addr. Read Data Write Data D-Cache Pipeline Stage Register Fetch + Addr. PC WB + 5 Read Reg. 1 # + 1 Mem 4 A B 0 Exec. Pipeline Stage Register 4 Decode LW $t1,4($s0) machine code Fetch LW $t1,4($s0): Decode Pipeline Stage Register LW $t1,4($s0): Fetch 0 1 33 34 Read data 2 Register File 16 Sign Extend 32 0 Res. 1 5 Read Reg. 2 # Addr. Read Data Write Data 0 1 Addr. Instruc. I-Cache 1 Write Reg. # Write Data Read data 1 Read data 2 Register File D-Cache 16 Sign Extend Mem Sh. Left 2 Zero 0 WB Res. 1 Addr. Read Data Write Data Pipeline Stage Register Write Data Zero Read Reg. 1 # Pipeline Stage Register Write Reg. # 0 Instruction Register I-Cache Instruction Register PC Instruc. 5 Exec. ALU 1 Addr. A B ALU 0 Read data 1 Decode PC Read Reg. 2 # 4 Sh. Left 2 Pipeline Stage Register 5 Fetch + Read Reg. 1 # WB + 5 + B Mem + A Exec. Pipeline Stage Register 4 Decode $t1 # / Offset=0x00000004 / $s0 value Fetch LW $t1,4($s0): Memory $t1 # / Address LW $t1,4($s0): Execute 0 1 D-Cache 32 Add offset 4 to $s0 value Read word from memory 35 36 16 Sign Extend 32 1 Addr. Read Data Write Data 0 0 1 Addr. PC Res. B 5 Read Reg. 1 # 5 Read Reg. 2 # Instruc. I-Cache 1 D-Cache Write Reg. # Write Data Fetch LW Read data 2 Register File 16 Write word to $t1 Read data 1 Sign Extend Sh. Left 2 Mem Zero 0 Res. 1 Addr. Read Data Write Data 0 1 D-Cache 32 Decode instruction and fetch operands WB Pipeline Stage Register Register File 0 $t1 # / Data read from memory Instruction Register Read data 2 A Exec. ALU Write Data Zero ALU Instruc. I-Cache Write Reg. # Read data 1 Pipeline Stage Register Read Reg. 2 # Decode Pipeline Stage Register Fetch + Addr. PC WB + 5 Sh. Left 2 + 5 Read Reg. 1 # + 1 Mem 4 A B 0 Exec. Pipeline Stage Register 4 Decode Instruction Register Fetch LW $t1,4($s0) Pipeline Stage Register LW $t1,4($s0): Writeback Add offset 4 to $s0 Read word Write from memory word to $t1 37 38 16 Sign Extend 1 Addr. Read Data Write Data 0 1 Addr. PC Res. 0 Instruc. I-Cache 1 $t5 # Read 5 5 Reg. 2 # Write Reg. # Write Data Read data 1 Read data 2 Register File D-Cache 32 Sh. Left 2 Reg. 1 # $t6 # Read 16 Sign Extend Pipeline Stage Register Register File 0 Pipeline Stage Register Read data 2 Zero Pipeline Stage Register Instruction Register Write Data Read data 1 B Mem Zero ALU Write Reg. # A Exec. 0 WB + I-Cache Read Reg. 2 # Sh. Left 2 ALU Instruc. 5 Read Reg. 1 # Decode Res. 1 Addr. Read Data Write Data Pipeline Stage Register Fetch + 5 Addr. PC WB + + 1 Mem 4 A B 0 Exec. Pipeline Stage Register 4 Decode ADD $t4,$t5,$t6 machine code Fetch ADD $t4,$t5,$t6: Decode Pipeline Stage Register ADD $t4,$t5,$t6: Fetch 0 1 D-Cache 32 $t4 # Decode instruction and fetch operands Fetch ADD and increment PC 39 40 16 Sign Extend 1 Addr. Read Data Write Data D-Cache 32 Add $t5 + $t6 0 0 1 1 Addr. PC Res. Read Reg. 1 # 5 Read Reg. 2 # Instruc. I-Cache Write Reg. # Write Data Read data 1 Read data 2 Register File 16 Sign Extend 32 Sh. Left 2 Zero 0 1 WB Res. Addr. Read Data Write Data D-Cache Just pass sum through Pipeline Stage Register Register File 0 Pipeline Stage Register Instruction Register Read data 2 5 Mem ALU Write Data Zero ALU Instruc. I-Cache Write Reg. # Read data 1 B Pipeline Stage Register Read Reg. 2 # A Exec. + 5 Sh. Left 2 Decode $t4 # / Sum of $t5 + $t6 Fetch + Addr. PC WB + 5 Read Reg. 1 # + 1 Mem 4 A B 0 Exec. $t4 # / $t6 value / $t5 value 4 Decode Instruction Register Fetch ADD $t4,$t5,$t6: Memory Pipeline Stage Register ADD $t4,$t5,$t6: Execute 0 1 41 42 Register File 16 Sign Extend 32 0 1 Res. Read Reg. 1 # 5 Read Reg. 2 # Addr. Read Data Write Data 0 1 Addr. Instruc. I-Cache 1 Write Reg. # Write Data Read data 2 Register File D-Cache 16 Write sum to $t4 Read data 1 Sign Extend Zero Res. 1 Addr. Read Data Write Data 0 1 D-Cache 32 Decode instruction and fetch operands Fetch ADD Sh. Left 2 0 WB Pipeline Stage Register Read data 2 0 PC Instruction Register Write Data Zero 5 Mem ALU I-Cache Write Reg. # ALU Instruc. Read data 1 $t4 # / Sum of $t5 + $t6 Read Reg. 2 # A B Pipeline Stage Register 5 Sh. Left 2 Exec. + Read Reg. 1 # Decode Pipeline Stage Register Fetch + Addr. PC WB + 5 + Add $t5 + $t6 Just pass sum through Write sum to $t4 43 44 Basic 5 Stage Pipeline • Control is generated in the decode stage and passed along to consuming stages through stage registers Read Reg. 1 # 5 Read Reg. 2 # 1 Addr. Instruc. I-Cache Instruction Register 0 Write Reg. # Write Data Read data 1 Read data 2 Register File 16 Sign Extend 32 Zero 0 1 0 1 ALU OLD PIPELINING Sh. Left 2 WB Res. Addr. Read Data Write Data D-Cache Pipeline Stage Register 5 + B Mem + A Exec. Pipeline Stage Register 4 Decode Pipeline Stage Register Fetch PC 1 Mem 4 A B 0 Exec. Pipeline Stage Register 4 Decode Instruction Register Fetch ADD $t4,$t5,$t6 Pipeline Stage Register ADD $t4,$t5,$t6: Writeback 0 1 45 Basic 5 Stage Pipeline • Control is generated in the decode stage and passed along to consuming stages through stage registers Read Reg. 1 # 5 Read Reg. 2 # Instruc. I-Cache Instruction Register PC 1 Addr. Write Reg. # Write Data Read data 1 Read data 2 Register File 16 Sign Extend 32 Zero 0 1 0 1 ALU 0 Sh. Left 2 WB Branch, MemRead, MemWrite Res. Addr. Read Data Write Data D-Cache Pipeline Stage Register 5 + B ALUSrc,RegDst, ALUOp, (Func) WB RegWrite, MemToReg + A Mem Mem WB Control 4 Exec. Pipeline Stage Register Ex Mem WB Decode Pipeline Stage Register Fetch 0 1
© Copyright 2026 Paperzz