COSC3330 Computer Architecture Lecture 11. ILP Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston Topic • ILP Role Play Fetch Unit Execution Unit 3 Clock 4 After Spring Break ~ 5 functional units 1 clock 4-5 instructions 5 Sequential Program Semantics • Human expects “sequential semantics” Tries to issue an instruction every clock cycle There are dependencies, control hazards and long latency instructions • To achieve performance with minimum effort To issue more instructions every clock cycle E.g., an embedded system can save power by exploiting instruction level parallelism and decrease clock frequency Scalar Pipeline (Baseline) Instruction Sequence • Machine Parallelism = D (= 5) • Issue Latency (IL) = 1 • Peak IPC = 1 D IF DE EX MEM WB 1 2 3 4 5 6 Execution Cycle Superpipelined Machine 1 major cycle = M minor cycles Machine Parallelism = M x D (= 15) per major cycle Issue Latency (IL) = 1 minor cycles Peak IPC = 1 per minor cycle = M per baseline cycle Superpipelined machines are simply deeper pipelined Instruction Sequence • • • • • IF 1 I I DE D D I EX D D D E E 2 3 4 5 6 7 8 9 1 2 MEM WB E M M M W W W E E M E E E D E E D D E D D D I D D I I D I I I 3 4 5 6 Execution Cycle Cost of Pipelining Superscalar Machine • Can issue > 1 instruction per cycle by hardware • Replicate resources, e.g., multiple adders or multi-ported data caches • Machine Parallelism = S x D (= 10) where S is superscalar degree • Issue Latency (IL) = 1 • IPC = 2 Instruction Sequence IF DE EX MEM 1 2 WB S 3 4 5 6 7 8 9 10 Execution Cycle Diversified Pipelines Separate pipelines for integer, multiply, FPU, load/store 11 Instruction Level Parallelism (ILP) • Basic idea Execute several instructions in parallel • We already do pipelining… But it can only churn out at best 1 instr/cycle • We want multiple instr/cycle Yes, it gets a bit complicated and we have to add a fan to cool the processor, but it delivers performance (power is another issue) That’s how we got from 486 (pipelined) to Pentium and beyond Is This Legal?!? • ISA defines instruction execution one by one I1: ADD R1 = R2 + R3 • fetch the instruction • read R2 and R3 • do the addition • write R1 • increment PC Now repeat for I2 • Darth Sidious: Begin landing your troops. Nute Gunray: Ah, my lord, is that... legal? Darth Sidious: I will make it legal. Sure, As Long As We Don’t Get Caught! • How about pipelining? already breaks the “rules” described on previous slide we fetch I2 before I1 has finished • Parallelism exists in that we perform different operations (fetch, decode, …) on several different instructions in parallel as mentioned, limit of 1 IPC 2 4 6 8 10 12 14 What Does It Mean to Not “Get Caught”? • Program executes correctly • Ok, what’s “correct”? As defined by the ISA Same processor state (registers, PC, memory) as if you had executed one-at-a-time • You can squash instructions that don’t correspond to the “correct” execution Example: Toll Booth D C Caravanning on a trip, must stay in A order to prevent losing anyone B When we get to the toll, everyone gets in the same lane to stay in order This works… but it’s slow. Everyone has to wait for D to get through the toll booth Lane 1 Go through two at a time (in parallel) Lane 2 Before Toll Booth You Didn’t See That… After Toll Booth Illusion of Sequentiality • So long as everything looks OK to the outside world you can do whatever you want! “Outside Appearance” = “Architecture” (ISA) “Whatever you want” = “Microarchitecture” mArch basically includes everything not explicitly defined in the ISA • pipelining, branch prediction, number of instructions issued per cycle, etc. Back to ILP… But How? • Simple ILP recipe Read and decode a few instructions each cycle • can’t execute > 1 IPC if we’re not fetching > 1 IPC If instructions are independent, do them at the same time If not, do them one at a time Ex. Original Pentium Fetch Fetch up to 32 bytes Decode1 Decode up to 2 insts Decode2 Decode2 Execute Execute Writeback Writeback Read operands and Check dependencies This is “Superscalar” • “Scalar” CPU executes one inst at a time includes pipelined processors • “Superscalar” can execute more than one inst at a time X + Y, W * Z ILP is Bounded • For any sequence of instructions, the available parallelism is limited • Hazards, Dependencies are what limit the ILP Data dependencies Control dependencies Memory dependencies Dependencies • Data Dependencies RAW: Read-After-Write (True Dependence) WAR: Anti-Depedence WAW: Output Dependence Data Dependencies • Register dependencies RAW, WAR, WAW, based on register number Read-After-Write Write-After-Read Write-After-Write A: R1 = R2 + R3 B: R4 = R1 * R4 A: R1 = R3 / R4 B: R3 = R2 * R4 A: R1 = R2 + R3 B: R1 = R3 * R4 R1 R2 R3 R4 5A 7 7 -2 -2 -2 9 9B 9 3 3 21 R1 R2 R3 R4 5A 3 3 B -2 -2 -2 9 9 -6 3 3 3 R1 R2 R3 R4 5 A 7 B 27 -2 -2 -2 9 9 9 3 3 3 R1 R2 R3 R4 5 5A 7 -2 -2 -2 9B 9 9 3 15 15 R1 R2 R3 R4 5 5 A -2 B -2 -2 -2 9 -6 -6 3 3 3 R1 R2 R3 R4 5 B 27 A 7 -2 -2 -2 9 9 9 3 3 3 ILP • Arrange instructions based on dependencies • ILP = Number of instructions / Longest Path R5 = 8(R6) R7 = R5 – R4 R9 = R7 * R7 R15 = 16(R6) R17 = R15 – R14 R19 = R15 * R15 Window in Search of ILP R5 = 8(R6) ILP = 1 R7 = R5 – R4 R9 = R7 * R7 R15 = 16(R6) ILP = 1.5 R17 = R15 – R14 R19 = R15 * R15 ILP = ? Window in Search of ILP R5 = 8(R6) R7 = R5 – R4 R9 = R7 * R7 R15 = 16(R6) R17 = R15 – R14 R19 = R15 * R15 Window in Search of ILP C1: R5 = 8(R6) R15 = 16(R6) C2: R7 = R5 – R4 R17 = R15 – R14 R19 = R15 * R15 C3: R9 = R7 * R7 • • • • ILP = 6/3 = 2 better than 1 and 1.5 Larger window gives more opportunities Who exploit the instruction window? But what limits the window? Scheduling • Central problem to ILP processing need to determine when parallelism (independent instructions) exist in Pentium example, decode stage checks for multiple conditions: • is there a data dependency? does one instruction generate a value needed by the other? do both instructions write to the same register? • is there a structural dependency? most CPUs only have one divider, so two divides cannot execute at the same time Scheduling • How many instructions are we looking for? 3-6 is typical today A CPU that can ideally* do N instrs per cycle is called “N-way superscalar”, “N-issue superscalar”, or simply “N-way” or “N-issue” • *Peak execution bandwidth • This “N” is also called the “issue width” Static (In-Order) Scheduling • Cycle 1 Start I1. Can we also start I2? No. • Cycle 2 Start I2. Can we also start I3? Yes. Can we also start I4? No. Program code I1: ADD R1, R2, R3 I2: SUB R4, R1, R5 I3: AND R6, R1, R7 I4: OR R8, R2, R6 I5: XOR R10, R2, R11 • If the next instruction cannot start, stop looking for things to do in this cycle! Dynamic (Out-of-Order) Scheduling • Cycle 1 Operands ready? I1, I5. Start I1, I5. Program code I1: ADD R1, R2, R3 I2: SUB R4, R1, R5 • Cycle 2 Operands ready? I2, I3. Start I2,I3. I3: AND R6, R1, R7 I4: OR R8, R2, R6 I5: XOR R10, R2, R11 • Window size (W): how many instructions ahead do we look. Do not confuse with “issue width” (N). E.g. a 4-issue out-of-order processor can have a 128-entry window (it can look at up to 128 instructions at a time). Ordering? • In previous example, I5 executed before I2, I3 and I4! • How to maintain the illusion of sequentiality? One-at-a-time = 45s 5s 5s Hands toll-booth agent a $100 bill; takes a while to count the change 30s 5s With a “4-Issue” Toll Booth L1 L2 L3 L4 OOO = 30s Out-of-Order Execution • We’re not executing instructions in data-flow order Great! More performance • But outside world can’t know about this Must maintain illusion of sequentiality Atom Processor • 2-issue simultaneous multithreading • 16 stage inorder • two integer ALUs • no instruction reordering Intel Quad Core • 4 cores/chip • 16 pipeline stages, ~3GHz • 4-wide superscalar • Out of order Quiz In-order or Out-of-Order? Two-way superscalar inorder RISC processor 36 Quiz In-order or Out-of-Order? Three in-order cores 37 Quiz In-order or Out-of-Order? Pentium iii 3-issue out of order 38 Quiz In-order or Out-of-Order? Dual-issue, in-order, Cortex A8 Iphone 4 39 Quiz In-order or Out-of-Order? Cortex-A9 MPCore, Out-of-order Iphone 4S 40 Cortex A9 • The A8 has a dual-issue inorder 13-stage integer pipeline. Doubling issue width increased IPC (instructions per clock) and the deeper pipeline gave it frequency headroom. • The Cortex A9 goes back down to an 8-stage pipeline. It's still a dual-issue pipeline, but instructions can execute out of order. ILP is Bounded • For any sequence of instructions, the available parallelism is limited • Hazards/Dependencies are what limit the ILP Data dependencies Control dependencies Memory dependencies RAW Memory Dependency • RAW (Read-After-Write) A writes to a location, B reads from the location, therefore B has a RAW dependency on A Also called a “true dependency” A: STORE R1, 0[R2] B: LOAD R5, 0[R2] Instructions executing in same cycle cannot have RAW WAR Memory Dependency • WAR (Write-After-Read) A reads from a location, B writes to the location, therefore B has a WAR dependency on A If B executes before A has read its operand, then the operand will be lost Also called an anti-dependence A: LOAD R5, 0[R2] B: STORE R3, 0[R2] A: LOAD R5, 0[R2] ADD R7, R5, R7 B: STORE R3, 0[R2] WAW Memory Dependency • Write-After-Write A writes to a location, B writes to the same location If B writes first, then A writes, the location will end up with the wrong value Also called an output-dependence A: STORE R1, 0[R2] B: STORE R3, 0[R2] A: STORE R1, 0[R2] LOAD R5, 0[R2] B: STORE R3, 0[R2] Memory Location Ambiguity • When the exact location is not known: A: STORE R1, 0[R2] B: LOAD R5, 24[R8] C: STORE R3, -8[R9] RAW exists if (R2+0) == (R8+24) WAR exists if (R8+24) == (R9 – 8) WAW exists if (R2+0) == (R9 – 8) Memory Dependency • Ambiguous dependency also forces “sequentiality” • To increase ILP, needs dynamic memory disambiguation mechanisms that are either safe or recoverable • ILP could be 1, could be 3, depending on the actual dependence i1: load r2, (r12) ? i2: store r7, 24(r20) ? i3: store r1, (0xFF00) ? Control Dependencies • If we have a conditional branch, until we actually know the outcome, all later instructions must wait That is, all instructions are control dependent on all earlier branches This is true for unconditional branches as well (e.g., can’t return from a function until we’ve loaded the return address) la $8, array beq $20, $22, L1 lb $10, 1($8) add $11, $9, $10 sb $11, ($8) L1: addiu $8, $8, 4 Pop from the Stack C code snippet z = fact(x); int fact(int n) { if (n < 1) return(1); else return(n * fact(n-1)) } $sp Return address $a0 (= X) MIPS snippet li $v0, 1 j fact fact: blt $a0, $v0, return sub $sp, $sp, 8 sw $ra, 4($sp) sw $a0, 0($sp) sub $a0, $a0, 1 jal fact lw $a0, lw $ra, mult $a0, mflo $v0 add $sp, return: jr $ra 0($sp) 4($sp) $v0 $sp, 8 Name Dependency • WAR and WAW result due to reuse of names R2 = R1 + R3 R1 = R5 – R7 R2 = R1 >> #3 • Would WAR and WAW exist with more registers? ILP Example • True dependency forces “sequentiality” • ILP = 3/3 = 1 load r2, (r12) t c2=i2: add r1, r2, 9 a c3=i3: mul r2, r5, r6 • False dependency removed • ILP = 3/2 = 1.5 c1=i1: o i1: load r2, (r12) t i2: add r1, r2, 9 i3: mul r8, r5, r6 c1: load r2, (r12) c2: add r1, r2, #9 mul r8, r5, r6 Eliminating WAR Dependencies • WAR dependencies are from reusing registers A: R1 = R3 / R4 B: R3 = R2 * R4 R1 R2 R3 R4 5A 3 3 B -2 -2 -2 9 9 -6 3 3 3 A: R1 =X R3 / R4 B: R5 = R2 * R4 R1 R2 R3 R4 5 5 A -2 B -2 -2 -2 9 -6 -6 3 3 3 R1 R2 R3 R4 R5 5 5A 3 B -2 -2 -2 9 9 9 3 3 3 4 -6 -6 With no dependencies, reordering still produces the correct results Eliminating WAW Dependencies • WAW dependencies are also from reusing registers A: R1 = R2 + R3 B: R1 = R3 * R4 R1 R2 R3 R4 5 A 7 B 27 -2 -2 -2 9 9 9 3 3 3 A: R5 = X R2 + R3 B: R1 = R3 * R4 R1 R2 R3 R4 5 B 27 A 7 -2 -2 -2 9 9 9 3 3 3 Same solution works R1 R2 R3 R4 R5 5 B 27 A 27 -2 -2 -2 9 9 9 3 3 3 4 4 7 Another Register Example When only 4 registers available R1 = 8(R0) R3 = R1 – 5 R2 = R1 * R3 24(R0) = R2 R1 = 16(R0) R3 = R1 – 5 R2 = R1 * R3 32(R0) = R2 ILP = Another Register Example When more registers (or register renaming) available R1 = 8(R0) R3 = R1 – 5 R2 = R1 * R3 24(R0) = R2 R5 = 16(R0) R1 R6 = R5 R3 R1 – 5 R7 = R5 R2 R1 * R3 R6 32(R0) = R2 R7 ILP = Obvious Solution: More Registers • Add more registers to the ISA? BAD!!! Changing the ISA can break binary compatibility All code must be recompiled Not a scalable solution Better Solution: Register Renaming • Give processor more registers than specified by the ISA temporarily map ISA registers (“logical” or “architected” registers) to the physical registers to avoid overwrites • Components: mapping mechanism physical registers • allocated vs. free registers • allocation/deallocation mechanism Register Renaming • Example Program code I1: ADD R1, R2, R3 I3 can not exec before I2 I2: SUB R2, R1, R6 because I3: AND R6, R11, R7 I3 will overwrite R6 I4: OR R8, R5, R2 I5 can not go before I2 because I5: XOR R2, R4, R11 I2, when it goes, will overwrite R2 with a stale value RAW WAR WAW Register Renaming • Solution: • Let’s give I2 temporary name/ location (e.g., S) for the value it produces. Program code I1: ADD R1, R2, R3 I2: SUB R2, S, R1, R6 R11,R7 R7 I3: AND R6, U, R11, I4: OR R8, R5, R2 S I5: XOR R2, T, R4, R11 • But I4 uses that value, so we must also change that to S… • In fact, all uses of R2 from I3 to the next instruction that writes to R2 again must now be changed to S! • We remove WAW deps in the same way: change R2 in I5 (and subsequent instrs) to T. Register Renaming • Implementation Space for S, T, U etc. How do we know when to rename a register? Program code I1: ADD R1, R2, R3 I2: SUB S, R1, R5 I3: AND U, R11, R7 I4: OR R8, R5, S I5: XOR T, R4, R11 • Simple Solution Do renaming for every instruction Change the name of a register each time we decode an instruction that will write to it. Remember what name we gave it Register File Organization • We need some physical structure to store the register values Architected Register File ARF “Outside” world sees the ARF RAT One Physical REG per instruction in-flight PRF Register Alias Table Physical Register File Putting it all Together top: • R1 = R2 + R3 • R2 = R4 – R1 • R1 = R3 * R6 • R2 = R1 + R2 • R3 = R1 >> 1 • BNEZ R3, top Free pool: X9, X11, X7, X2, X13, X4, X8, X12, X3, X5… ARF PRF R1 R2 R3 R4 R5 R6 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 RAT R1 R2 R3 R4 R5 R6 R1 R2 R3 R4 R5 R6 Renaming in action R1 = R2 + R3 R2 = R4 – R1 R1 = R3 * R6 R2 = R1 + R2 R3 = R1 >> 1 BNEZ R3, top R1 = R2 + R3 R2 = R4 – R1 R1 = R3 * R6 R2 = R1 + R2 R3 = R1 >> 1 BNEZ R3, top = R2 + R3 = R4 – = R3 * R6 = + = >> BNEZ = = = = = BNEZ 1 1 Free pool: X9, X11, X7, X2, X13, X4, X8, X12, X3, X5… ARF PRF , top + – * R6 + >> R1 R2 R3 R4 R5 R6 , top RAT X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 R1 R2 R3 R4 R5 R6 R1 R2 R3 R4 R5 R6 Even Physical Registers are Limited • We keep using new physical registers What happens when we run out? • There must be a way to “recycle” • When can we recycle? When we have given its value to all instructions that use it as a source operand! This is not as easy as it sounds Instruction Commit (Leaving the Pipe) Architected register file contains the “official” processor state R3 ARF R3 RAT PRF T42 When an instruction leaves the pipeline, it makes its result “official” by updating the ARF The ARF now contains the correct value; update the RAT Free Pool T42 is no longer needed, return to the physical register free pool Careful with the RAT Update! Update ARF as usual Deallocate physical register R3 ARF R3 RAT T17 PRF Don’t touch that RAT! (Someone else is the most recent writer to R3) At some point in the future, the newer writer of R3 exits T42 Free Pool This instruction was the most recent writer, now update the RAT Deallocate physical register Cortex A9
© Copyright 2024 Paperzz