CS:APP Chapter 4 Computer Architecture Pipelined Implementation Part II Randal E. Bryant Carnegie Mellon University http://csapp.cs.cmu.edu CS:APP Overview Make the pipelined processor work! Data Hazards Instruction having register R as source follows shortly after instruction having register R as destination Common condition, don’t want to slow down pipeline Control Hazards Mispredict conditional branch Our design predicts all branches as being taken Naïve pipeline executes two extra instructions Getting return address for ret instruction Naïve pipeline executes three extra instructions Making Sure It Really Works –2– What if multiple special cases happen simultaneously? CS:APP W_icode, W_valM Pipeline Stages W_valE, W_valM, W_dstE, W_dstM W valM Fetch Memory Data Data memory memory M_icode, M_Bch, M_valA Addr, Data Select current PC Read instruction Compute incremented PC M Bch valE CC CC Execute ALU ALU aluA, aluB Decode E Read program registers valA, valB Execute d_srcA, d_srcB Decode A B Register Register M file file E Write back Operate ALU Memory D icode, ifun, rA, rB, valC Instruction Instruction memory memory Fetch Write Back –3– valP PC PC increment increment predPC Read or write data memory PC valP f_PC F Update register file CS:APP Write back PIPE- Hardware W icode valE Mem. control write Pipeline registers hold intermediate values from instruction execution Forward (Upward) Paths Values passed from one stage to next Cannot jump past stages Data Data memory memory Memory data in Addr M_valA M_Bch M icode Bch valE valA dstE dstM e_Bch Execute E ALU fun. ALU ALU CC CC icode ifun ALU A ALU B valC valA valB dstE dstM srcA srcB d_srcA d_srcB Select A Decode D Fetch d_rvalA A dstE dstM srcA srcB W_valM B Register Register M file file E e.g., valC passes through decode dstE dstM data out read valM icode ifun rA rB Instruction Instruction memory memory valC W_valE valP PC PC increment increment Predict PC f_PC M_valA Select PC F –4– W_valM predPC CS:APP Data Dependencies: 2 Nop’s # demo-h2.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W F D E M W F D E M 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 0x00e: addl %edx,%eax 0x010: halt 6 7 8 9 10 W Cycle 6 W f%eax R[ %eax] R[ 3 ] f3 • • • D valA f R[ %edx] = 10 valB f R[ %eax] = 0 –5– Error CS:APP Data Dependencies: No Nop # demo-h0.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax 0x00e: halt 6 7 8 W Cycle 4 M M_valE = 10 M_dstE = %edx E e_valE f 0 + 3 = 3 E_dstE = %eax D valA f R[ %edx] = 0 valB f R[ %eax] = 0 –6– Error CS:APP Stalling for Data Dependencies # demo-h2.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W E M W D D E M W F F D E M 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 6 bubble 0x00e: addl %edx,%eax 0x010: halt –7– F 7 8 9 10 11 W If instruction follows too closely after one that writes register, slow it down Hold instruction in decode Dynamically inject nop into execute stage CS:APP Write back Stall Condition W icode valE dstE dstM data out read Mem. control Source Registers valM write Memory Data Data memory memory data in Addr M_valA M_Bch srcA and srcB of current instruction in decode stage Destination Registers dstE and dstM fields Instructions in execute, memory, and write-back stages Special Case Don’t stall for register ID 8 M –8– Bch valE valA dstE dstM dstE dstM srcA e_Bch Execute E icode ALU fun. ALU ALU CC CC ifun ALU A ALU B valC valA valB srcB d_srcA d_srcB Select A Decode d_rvalA A dstE dstM srcA srcB W_valM B Register RegisterM file file W_valE E D Fetch icode ifun rA rB Instruction Instruction memory memory valC valP PC PC increment increment Predict PC f_PC M_valA Select PC Indicates absence of register operand icode F W_valM predPC CS:APP Detecting Stall Condition # demo-h2.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M W E M W D D E M W F F D E M 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 6 bubble 0x00e: addl %edx,%eax 0x010: halt F 7 8 9 10 11 W Cycle 6 W W_dstE = %eax W_valE = 3 • • • D –9– srcA = %edx srcB = %eax CS:APP Stalling X3 # demo-h0.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W E M W E M W E M W 0x006: irmovl $3,%eax bubble bubble 6 bubble 0x00c: addl %edx,%eax F 0x00e: halt 7 8 9 10 D D D D E M W F F F F D E M 11 W Cycle 6 W Cycle 5 W_dstE = %eax M Cycle 4 M_dstE = %eax E • • • D E_dstE = %eax D – 10 – srcA = %edx srcB = %eax srcA = %edx srcB = %eax • • • D srcA = %edx srcB = %eax CS:APP What Happens When Stalling? # demo-h0.ys 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax 0x00e: halt Cycle 8 4 5 6 7 Write Back Memory Execute Decode Fetch 0x000: bubble 0x006: irmovl $10,%edx $3,%eax 0x000: bubble 0x006: irmovl $10,%edx $3,%eax 0x006: bubble 0x00c: irmovl addl %edx,%eax $3,%eax 0x00c: halt 0x00e: addl %edx,%eax 0x00e: halt Stalling instruction held back in decode stage Following instruction stays in fetch stage Bubbles injected into execute stage Like dynamically generated nop’s Move through later stages – 11 – CS:APP Implementing Stalling W_dstM W_dstE W icode valE valM dstE dstM valE valA dstE dstM M_dstM M_dstE M icode Bch E_dstM Pipe control logic E_dstE E_bubble E icode ifun valC valA valB dstE dstM srcA srcB d_srcB d_srcA srcB D_icode D_stall F_stall D srcA icode ifun F rA rB valC valP predPC Pipeline Control – 12 – Combinational logic detects stall condition Sets mode signals for how pipeline registers should update CS:APP Pipeline Register Modes Input = y Output = x x Normal stall =0 Output = x x stall =1 – 13 – Output = x x stall =0 y _ Rising clock _ Output = x x bubble =0 Input = y Bubble _ Output = y bubble =0 Input = y Stall _ Rising clock _ Rising clock _ n o p Output = nop bubble =1 CS:APP Data Forwarding Naïve Pipeline Register isn’t written until completion of write-back stage Source operands read from register file in decode stage Needs to be in register file at start of stage Observation Value generated in execute or memory stage Trick – 14 – Pass value directly from generating instruction to decode stage Needs to be available at end of decode stage CS:APP Data Forwarding Example # demo-h2.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D F E D F M E D F W M E D F 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 0x00e: addl %edx,%eax 0x010: halt – 15 – irmovl in writeback stage Destination value in W pipeline register Forward as valB for decode stage 6 7 8 9 10 W M E D F W M E D W M E W M W Cycle 6 W R[ %eax] f 3 W_dstE = %eax W_valE = 3 • • • D srcA = %edx srcB = %eax valA f R[ %edx] = 10 valB f W_valE = 3 CS:APP W_icode, W_valM W_valE, W_valM, W_dstE, W_dstM Bypass Paths W_valE W_valM W Decode Stage m_valM Forwarding logic selects Memory valA and valB Normally from register file Forwarding: get valA or valB from later pipeline Execute stage Addr, Data M_valE M Execute: valE Memory: valE, valM Write back: valE, valM e_valE Bch CC CC ALU ALU E_valA, E_valB, E_srcA, E_srcB Forwarding Sources Data Data memory memory M_icode, M_Bch, M_valA E valA, valB Forward d_srcA, d_srcB Decode A B Register Register M file file E Write back – 16 – D valP CS:APP Data Forwarding Example #2 # demo-h0.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D E M W F D E M W F D E M W F D E M 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax 0x00e: halt Register %edx Generated by ALU during previous cycle Forward from memory as valA Register %eax – 17 – Value just generated by ALU Forward from execute as valB 6 7 8 W Cycle 4 M M_dstE = %edx M_valE = 10 E E_dstE = %eax e_valE f 0 + 3 = 3 D srcA = %edx srcB = %eax valA f M_valE = 10 valB f e_valE = 3 CS:APP Implementing Forwarding W_valE Write back W_valM W icode valE valM dstE dstM data out read m_valM Data Data memory memory Mem. control write Memory data in Addr M_Bch M icode M_valA M_valE Bch valE valA dstE dstM e_Bch e_valE ALU ALU CC CC Execute E icode ifun ALU fun. ALU A ALU B valC valA valB dstE dstM srcA Add additional feedback paths from E, M, and W pipeline registers into decode stage Create logic blocks to select from multiple sources for valA and valB in decode stage srcB d_srcA d_srcB dstE dstM srcA Sel+Fwd A Decode Fwd B A W_valM B Register Register M file file W_valE E D icode ifun – 18 – rA rB Instruction Instruction memory valC srcB valP PC PC increment Predict PC CS:APP Implementing Forwarding W_valE W_valM valE valM dstE dstM data out read m_valM Data Data memory memory l write data in Addr M_valA M_valE valE valA dstE dstM e_valE ALU ALU ALU fun. ALU A ALU B valC valA valB dstE dstM srcA srcB d_srcA d_srcB dstE dstM srcA Sel+Fwd A Fwd B A W_valM B Register Register M file file E –valC 19 – srcB valP ## What should be the A value? int new_E_valA = [ # Use incremented PC D_icode in { ICALL, IJXX } : D_valP; # Forward valE from execute d_srcA == E_dstE : e_valE; # Forward valM from memory d_srcA == M_dstM : m_valM; # Forward valE from memory d_srcA == M_dstE : M_valE; # Forward valM from write back d_srcA == W_dstM : W_valM; # Forward valE from write back d_srcA == W_dstE : W_valE; # Use value read from register file 1 : d_rvalA; ]; W_valE CS:APP Limitation of Forwarding # demo-luh.ys 1 2 3 4 5 0x000: irmovl $128,%edx F D E M W F D E M W F D E M W F 0x018: mrmovl 0(%edx),%eax # Load %eax D E M W F D E M W F D E M W F D E M 0x006: irmovl $3,%ecx 0x00c: rmmovl %ecx, 0(%edx) 0x012: irmovl $10,%ebx 6 0x01e: addl %ebx,%eax # Use %eax 0x020: halt Load-use dependency Value needed by end of decode stage in cycle 7 Value read from memory in memory stage of cycle 8 7 8 9 10 Cycle 7 Cycle 8 M M M_dstE = %ebx M_valE = 10 11 W M_dstM = %eax m_valM f M[128] = 3 • • • D valA f M_valE = 10 valB f R[%eax] = 0 – 20 – Error CS:APP Avoiding Load/Use Hazard # demo-luh.ys 1 2 3 4 5 0x000: irmovl $128,%edx F D E M W F D E M W F D E M W F 0x018: mrmovl 0(%edx),%eax # Load %eax D E M W F D E M W E M W D D E M W F F D E M 0x006: irmovl $3,%ecx 0x00c: rmmovl %ecx, 0(%edx) 0x012: irmovl $10,%ebx 6 7 8 bubble 0x01e: addl %ebx,%eax # Use %eax 0x020: halt Stall using instruction for one cycle Can then pick up loaded value by forwarding from memory stage F 9 10 11 12 W Cycle 8 W W_dstE = %ebx W_valE = 10 M M_dstM = %eax m_valM f M[128] = 3 • • • D – 21 – valA f W_valE = 10 valB f m_valM = 3 CS:APP Data Data memory memory Mem. control write Memory Detecting Load/Use Hazard data in Addr M_Bch M icode M_valA M_valE Bch valE valA dstE dstM e_Bch e_valE ALU ALU CC CC Execute E icode ifun ALU fun. ALU A ALU B valC valA valB dstE dstM dstE dstM srcA srcB d_srcA d_srcB Sel +Fwd A Decode D icode rA rB valC srcB Fwd B A ifun srcA W_valM B Register RegisterM file file E W_valE valP Predict PC Condition Fetch Instruction Instruction memory memory Trigger PC PC increment increment f_PC Load/Use Hazard F – 22 – M_valA E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } Select PC W_valM predPC CS:APP Control for Load/Use Hazard # demo-luh.ys 1 2 3 0x000: 0x006: 0x00c: 0x012: 0x018: 4 irmovl $128,%edx F D E M irmovl $3,%ecx F D E rmmovl %ecx, 0(%edx) F D irmovl $10,%ebx F mrmovl 0(%edx),%eax # Load %eax bubble 0x01e: addl %ebx,%eax # Use %eax 0x020: halt 6 7 W M E D F W M E D W M E F D F 8 9 10 11 12 W M E D F W M E D W M E W M W Stall instructions in fetch and decode stages Inject bubble into execute stage Condition Load/Use Hazard – 23 – 5 F D E M W stall stall bubble normal normal CS:APP Branch Misprediction Example demo-j.ys 0x000: xorl %eax,%eax 0x002: jne t 0x007: irmovl $1, %eax 0x00d: nop 0x00e: nop 0x00f: nop 0x010: halt 0x011: t: irmovl $3, %edx 0x017: irmovl $4, %ecx 0x01d: irmovl $5, %edx – 24 – # Not taken # Fall through # Target (Should not execute) # Should not execute # Should not execute Should only execute first 8 instructions CS:APP Handling Misprediction # demo-j.ys 1 2 3 4 5 0x000: xorl %eax,%eax F D E M W 0x002: jne target # Not taken F D E M W F D E M W bubble D E M W 0x007: irmovl $1,%eax # Fall through F D E M W 0x00d: nop F D E M 0x011: t: irmovl $2,%edx # Target bubble 0x017: irmovl $3,%ebx # Target+1 6 7 8 9 10 F W Predict branch as taken Fetch 2 instructions at target Cancel when mispredicted – 25 – Detect branch not-taken in execute stage On following cycle, replace instructions in execute and decode by bubbles No side effects have occurred yet CS:APP W_valM W icode valE valM dstE dstM Detecting Mispredicted Branch data out read Data Data memory memory Mem. control write Memory m_valM data in Addr M_Bch M icode M_valA M_valE Bch valE valA dstE dstM e_Bch e_valE ALU ALU CC CC Execute E icode ifun ALU fun. ALU A ALU B valC valA valB dstE dstM dstE dstM srcA srcB d_srcA d_srcB Sel +Fwd A Condition srcA srcB Fwd B Trigger Decode A W_valM B Register RegisterM file file E Mispredicted Branch E_icode = IJXX & !e_Bch D Fetch icode ifun rA rB Instruction Instruction memory memory valC W_valE valP PC PC increment increment Predict PC f_PC – 26 – M_valA Select PC W_valM CS:APP Control for Misprediction # demo-j.ys 1 2 3 4 5 0x000: xorl %eax,%eax F D E M W 0x002: jne target # Not taken F D E M W F D E M W bubble D E M W 0x007: irmovl $1,%eax # Fall through F D E M W 0x00d: nop F D E M 0x011: t: irmovl $2,%edx # Target bubble 0x017: irmovl $3,%ebx # Target+1 Condition F Mispredicted Branch normal – 27 – 6 7 8 9 10 F W D E M W bubble bubble normal normal CS:APP demo-retb.ys Return Example 0x000: 0x006: 0x00b: 0x011: 0x020: 0x020: 0x026: 0x027: 0x02d: 0x033: 0x039: 0x100: 0x100: – 28 – irmovl Stack,%esp call p irmovl $5,%esi halt .pos 0x20 p: irmovl $-1,%edi ret irmovl $1,%eax irmovl $2,%ecx irmovl $3,%edx irmovl $4,%ebx .pos 0x100 Stack: # Initialize stack pointer # Procedure call # Return point # procedure # # # # Should Should Should Should not not not not be be be be executed executed executed executed # Stack: Stack pointer Previously executed three additional instructions CS:APP Correct Return Example # demo-retb 0x026: ret F bubble D E M W F D E M W F D E M W F D E M W F D E M bubble bubble 0x00b: irmovl $5,%esi # Return As ret passes through pipeline, stall at fetch stage W W valM = 0x0b While in decode, execute, and memory stage – 29 – Inject bubble into decode stage Release stall when reach write-back stage • • • F valC f 5 rB f %esi CS:APP Detecting Return M_Bch M icode M_valE Bch valE valA dstE dstM e_Bch e_valE ALU ALU CC CC Execute E icode ifun ALU fun. ALU A ALU B valC valA valB dstE dstM srcA srcB d_srcA d_srcB dstE dstM srcA Sel+Fwd A Decode D – 30 – icode ifun Fwd B A B Register Register M file file E rA rB valC srcB W_valM W_valE valP Condition Trigger Processing ret IRET in { D_icode, E_icode, M_icode } CS:APP Control for Return # demo-retb 0x026: ret F bubble D E M W F D E M W F D E M W F D E M W F D E M bubble bubble 0x00b: irmovl $5,%esi # Return Condition Processing ret – 31 – W F D E M W stall bubble normal normal normal CS:APP Special Control Cases Detection Condition Trigger Processing ret IRET in { D_icode, E_icode, M_icode } Load/Use Hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } Mispredicted Branch E_icode = IJXX & !e_Bch Action (on next cycle) Condition F D E M W Processing ret stall bubble normal normal normal Load/Use Hazard stall stall bubble normal normal bubble bubble normal normal Mispredicted Branch normal – 32 – CS:APP Implementing Pipeline Control W icode valE valM dstE dstM valE valA dstE dstM M_icode M icode Bch e_Bch CC CC E_dstM E_icode Pipe control logic E_bubble E icode ifun valC valA valB dstE dstM srcA srcB d_srcB d_srcA srcB D_icode srcA D_bubble – 33 – D_stall D F_stall F icode ifun rA rB valC valP predPC Combinational logic generates pipeline control signals Action occurs at start of following cycle CS:APP Initial Version of Pipeline Control bool F_stall = # Conditions for a load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode }; bool D_stall = # Conditions for a load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB }; bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode }; bool E_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB}; – 34 – CS:APP Control Combinations Load/use M ret 1 Mispredict M E Load E D Use D JXX ret 2 M M E E D D ret ret 3 M ret ret E bubble bubble D bubble Combination A Combination B Special cases that can arise on same clock cycle Combination A Not-taken branch ret instruction at branch target Combination B – 35 – Instruction that reads from memory to %esp Followed by ret instruction CS:APP A Decode A Control Combination A D M JXX E D ifun rA Fetch rB valC valP Instruction Instruction memory memory PC PC increment increment Predict PC f_PC M_valA E D W_valE E ret 1 Mispredict M icode W_valM B Register RegisterM file file Select PC ret F W_valM predPC Combination A Condition F D E M W stall bubble normal normal normal Mispredicted Branch normal bubble bubble normal normal Combination bubble bubble normal normal Processing ret – 36 – stall Should handle as mispredicted branch Stalls F pipeline register But PC selection logic will be using M_valM anyhow CS:APP Control Combination B ret 1 Load/use M M E Load E D Use D ret Combination B Condition F D E M W Processing ret stall bubble normal normal normal Load/Use Hazard stall stall bubble normal normal Combination stall bubble + bubble stall normal normal – 37 – Would attempt to bubble and stall pipeline register D Signaled by processor as pipeline error CS:APP Handling Control Combination B ret 1 Load/use M M E Load E D Use D ret Combination B Condition F D E M W Processing ret stall bubble normal normal normal Load/Use Hazard stall stall bubble normal normal Combination stall stall bubble normal normal – 38 – Load/use hazard should get priority ret instruction should be held in decode stage for additional cycle CS:APP Corrected Pipeline Control Logic bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Stalling at fetch while ret passes IRET in { D_icode, E_icode, M_icode # but not condition for a load/use && !(E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB Condition through pipeline } hazard }); F D E M W Processing ret stall bubble normal normal normal Load/Use Hazard stall stall bubble normal normal Combination stall stall bubble normal normal – 39 – Load/use hazard should get priority ret instruction should be held in decode stage for additional cycle CS:APP Pipeline Summary Data Hazards Most handled by forwarding No performance penalty Load/use hazard requires one cycle stall Control Hazards Cancel instructions when detect mispredicted branch Two clock cycles wasted Stall fetch stage while ret passes through pipeline Three clock cycles wasted Control Combinations Must analyze carefully First version had subtle bug Only arises with unusual instruction combination – 40 – CS:APP
© Copyright 2024 Paperzz